2017-03-10

算法设计与分析[0003] 一道阿里巴巴面试题（2017）

　本文通过一道阿里面试题（下图），说说关于该题的字符串最长子串的查找问题。

转化问题1：求一个字符串中连续出现次数最多的子串

#include "stdio.h"
#include "string.h"
 
int count = 0;
char sub_str[2000];
 
void find_substr(char *str) {
     int str_len = strlen(str);
     int i, j, k;
     int tmp_cnt = 0;
 
     for (i = 0; i < str_len; i++) {
         for (j = i+2; j < str_len; j++) {
             int n = j-i;                                                     //sub string length
             tmp_cnt = 1;
             if (strncmp(&str[i], &str[j], n) == 0) {  //compare n-lengths strings
                 tmp_cnt++;                                             //they are equal, so add count
                 for (k = j+n; k < str_len; k += n) {      //consecutive checking
                     if (strncmp(&str[i], &str[k], n) == 0) {
                         tmp_cnt++;
                     }
                     else {
                         break;
                     }
                 }
                 if (count < tmp_cnt) {
                     count = tmp_cnt;
                     memcpy(sub_str, &str[i], n);        //record the sub string
                 }
             }
         }
     }
}
 
int main(){
     char str[2000];
     scanf("%s", str);
     find_substr(str);
     printf("%s\n", sub_str);
     return 0;
}

分析：

对于连续出现的子串，我们可以以候选子串长度作为step
通过指定step长度的滑动窗口，统计候选子串出现的次数，并计数
维护出现次数最多的子串
转化问题2：求一个字符串中出现次数最多的子串(不一定是连续的)
　假设存在一个长度为 N 的子串 S 出现的次数最多。那么它具有哪些特点呢？
S的任一子串的出现次数不少于 S 的出现次数
S中不会出现重复的子串字符
S中不会出现重复的字符

组成 S 的每一个字符、每一个子串的出现次数都和 S 一样
　“S 中不会出现重复的字符”，“组成 S 的每一个字符、每一个子串的出现次数都和 S 一样”！有了这个结论，问题就简单了。

// mostTimesSubstring.cpp
#include <iostream>
#include <cstring>
#include <deque>
#include <cstdio>
 
#define BUFFSIZE 1024
 
using std::cout;
using std::endl;
using std::string;
using std::deque;
 
typedef deque<string> strlist;
 
const int npos = -1;
const string ignoreChars(" /t/n/r");
 
inline bool IgnoreChar(char c){
    return (ignoreChars.find(c) != npos);
}
 
// return the max count
unsigned TextSummary(const string& text, unsigned usecount[], int Num4Chars){
    unsigned max_count, i;
 
    memset(usecount, 0, Num4Chars*sizeof(unsigned));
    for(i = 0; i < text.length(); usecount[unsigned(text[i++])]++);
 
    for(max_count = i = 0; i < 256; i++){
        if(IgnoreChar(i)) continue;
        if(usecount[i] > max_count) max_count = usecount[i];
    }
 
    return max_count;
}
 
// check whether current substring splicing one more char also reach max count
char StringTryGrowthOneChar(const string& text, const string& str, int maxcount, unsigned* usecount){
    int count;
    string::size_type pos;
    char c = 0;
 
    pos = text.find(str);
    if(pos == npos)
        return 0;
 
    // not the max count char
    c = text[pos + str.length()];
    if(usecount[unsigned(c)] < maxcount)
        return 0;
 
    // make sure every char in this growing substring also reach max count
    for(count = 0; pos + str.length() + 1 < text.length(); pos += str.length()){
        pos = text.find(str, pos);
        if(pos == npos) break;
        if(c != text[pos + str.length()]) return 0;
    }
 
    return c;
}
 
void PrintResult(const strlist& result){
    strlist::const_iterator citer;
 
    cout << endl << "The result substrings :" << endl;
    cout << "-------------------------------------" << endl;
    for(citer = result.begin(); citer != result.end(); citer++){
        // substring should longer than 2 chars
        if((*citer).length() > 1)
            cout << '"' << *citer << '"'<<endl;
    }
    cout << "-------------------------------------" << endl;
    cout << "Total : " << result.size() << endl;
}
 
int main( ){
    unsigned usecount[256];
    char buffer[BUFFSIZE], c;
    unsigned count, i;
    string text;
    strlist result;
 
    while(!feof(stdin)){
        if(fgets(buffer, sizeof(buffer), stdin))
            text = buffer;
 
        // Count the number of occurrences of characters
        count = TextSummary(text, usecount, sizeof(usecount)/sizeof(*usecount));
        cout << "Max count :" << count << endl;
 
        if(1 >= count){
            cout << "No longest substring!" << endl;
            continue;
        }
 
        // result holds the substring reach max count
        for(i = 0; i < sizeof(usecount)/sizeof(*usecount); i++){
            if(usecount[i] == count)
                result.push_back(string(1, char(i)));
        }
 
        // substring growing for more substrings
        for(strlist::iterator iter = result.begin(); iter != result.end(); iter++){
            c = StringTryGrowthOneChar(text, *iter, count, usecount);
            if(c)
                result.push_back(*iter + string(1, c));
        }
 
        PrintResult(result);
        result.clear();
    }
 
    return 0;
}

编译构建：

1	$ g++ -o mostTimesSubstring mostTimesSubstring.cpp

算法分析：

找到文本中的出现次数最高的单个字符组成的子串，放入一个队列中
从队列的头部开始，对每一个子串 S 进行处理，找到文本中该子串出现的任意一个位置 P，判断文本中紧随 S 之后的字符 C 是否的出现次数是最多的
- 如果 C 的出现次数不是最多的，结束。
- 如果 C 的出现次数是最多的，搜索文本中的每一个 S 并判断紧随其后的字符是否是 C
  - 如果文本中的每一个 S 之后都存在字符 C ，将 S + C 生成的子串放入结果集中
  - 如果文本中出现 S 之后的字符不是 C ，结束。
- 如此，直至到达队列尾。

回到这道面试题

该题统计的是：子串出现次数与子串长度的乘积，问题是，是否这个乘积的最大值总是：（1）出现次数最多的；（2）长度最长的，显然不是
问题分析1：我们需要穷举所有子串并计数各自出现的次数，最终获取乘积最大的子串
问题分析2：能否不穷举，对上述算法进行变形？

本文标题:算法设计与分析[0003] 一道阿里巴巴面试题（2017）

文章作者:Gary

发布时间:2017-03-10, 14:00:22

最后更新:2020-03-28, 12:37:22

原始链接:http://durant35.github.io/2017/03/10/Algorithms_An Alibaba Interview Question/

转化问题1：求一个字符串中连续出现次数最多的子串

转化问题2：求一个字符串中出现次数最多的子串(不一定是连续的)

回到这道面试题